Security News
GitHub Removes Malicious Pull Requests Targeting Open Source Repositories
GitHub removed 27 malicious pull requests attempting to inject harmful code across multiple open source repositories, in another round of low-effort attacks.
got-scraping
Advanced tools
Got Scraping is a small but powerful got
extension with the purpose of sending browser-like requests out of the box. This is very essential in the web scraping industry to blend in with the website traffic.
$ npm install got-scraping
Note:
- Node.js >=15.10.0 is required due to instability of HTTP/2 support in lower versions.
Got scraping package is built using the got.extend(...)
functionality, therefore it supports all the features Got has.
Interested what's under the hood?
const { gotScraping } = require('got-scraping');
gotScraping
.get('https://apify.com')
.then( ({ body }) => console.log(body))
proxyUrl
Type: string
URL of the HTTP or HTTPS based proxy. HTTP/2 proxies are supported as well.
const { gotScraping } = require('got-scraping');
gotScraping
.get({
url: 'https://apify.com',
proxyUrl: 'http://usernamed:password@myproxy.com:1234',
})
.then(({ body }) => console.log(body))
useHeaderGenerator
Type: boolean
Default: true
Whether to use the generation of the browser-like headers.
headerGeneratorOptions
See the HeaderGeneratorOptions
docs.
const response = await gotScraping({
url: 'https://api.apify.com/v2/browser-info',
headerGeneratorOptions:{
browsers: [
{
name: 'chrome',
minVersion: 87,
maxVersion: 89
}
],
devices: ['desktop'],
locales: ['de-DE', 'en-US'],
operatingSystems: ['windows', 'linux'],
}
});
sessionToken
A non-primitive unique object which describes the current session. By default, it's undefined
, so new headers will be generated every time. Headers generated with the same sessionToken
never change.
Thanks to the included header-generator
package, you can choose various browsers from different operating systems and devices. It generates all the headers automatically so you can focus on the important stuff instead.
Yet another goal is to simplify the usage of proxies. Just pass the proxyUrl
option and you are set. Got Scraping automatically detects the HTTP protocol that the proxy server supports. After the connection is established, it does another ALPN negotiation for the end server. Once that is complete, Got Scraping can proceed with HTTP requests.
Using the same HTTP version that browsers do is important as well. Most modern browsers use HTTP/2, so Got Scraping is making a use of it too. Fortunately, this is already supported by Got - it automatically handles ALPN protocol negotiation to select the best available protocol.
HTTP/1.1 headers are always automatically formatted in Pascal-Case
. However, there is an exception: x-
headers are not modified in any way.
By default, Got Scraping will use an insecure HTTP parser, which allows to access websites with non-spec-compliant web servers.
Last but not least, Got Scraping comes with updated TLS configuration. Some websites make a fingerprint of it and compare it with real browsers. While Node.js doesn't support OpenSSL 3 yet, the current configuration still should work flawlessly.
To get more detailed information about the implementation, please refer to the source code.
This package can only generate all the standard attributes. You might want to add the referer
header if necessary. Please bear in mind that these headers are made for GET requests for HTML documents. If you want to make POST requests or GET requests for any other content type, you should alter these headers according to your needs. You can do so by passing a headers option or writing a custom Got handler.
This package should provide a solid start for your browser request emulation process. All websites are built differently, and some of them might require some additional special care.
const response = await gotScraping({
url: 'https://apify.com/',
headers: {
'user-agent': 'test',
},
});
For more advanced usage please refer to the Got documentation.
You can parse JSON with this package too, but please bear in mind that the request header generation is done specifically for HTML
content type. You might want to alter the generated headers to match the browser ones.
const response = await gotScraping({
responseType: 'json',
url: 'https://api.apify.com/v2/browser-info',
});
This section covers possible errors that might happen due to different site implementations.
RequestError: Client network socket disconnected before secure TLS connection was established
The error above can be a result of the server not supporting the provided TLS setings. Try changing the ciphers parameter to either undefined
or a custom value.
FAQs
HTTP client made for scraping based on got.
The npm package got-scraping receives a total of 29,815 weekly downloads. As such, got-scraping popularity was classified as popular.
We found that got-scraping demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 0 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
GitHub removed 27 malicious pull requests attempting to inject harmful code across multiple open source repositories, in another round of low-effort attacks.
Security News
RubyGems.org has added a new "maintainer" role that allows for publishing new versions of gems. This new permission type is aimed at improving security for gem owners and the service overall.
Security News
Node.js will be enforcing stricter semver-major PR policies a month before major releases to enhance stability and ensure reliable release candidates.